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Motivation: In the last few years a growing interest in biology has been shifting towards the 
problem of optimal information extraction from the huge amount of data generated via large scale 
and high-throughput techniques. One of the most relevant issues has recently become that of cor- 
rectly and reliably predicting the functions of observed but still functionally undetermined proteins 
starting from information coming from the network of co-observed proteins of known functions—. 
Method: The method proposed in this article is based on a message passing algorithm known 
as Belief Propagation which takes as input the network of proteins physical interactions and a 
catalog of known proteins functions, and returns the probabilities for each unclassified protein of 
having one chosen function. The implementation of the algorithm allows for fast on-line analysis, 
and can be easily generalized to more complex graph topologies taking into account hyper-graphs, 
i.e. complexes of more than two interacting proteins. 

Results: The benchmark of our method is the Saccaromices Cerevisice protein-protein interaction 
network (PPI)i- and the validity of our approach is successfully tested against other available 
techniques-*"*-. 
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I. INTRODUCTION 

The most classical protein function prediction methods are those inferring similarity in function from sequence 
homologies between proteins listed in databases using programs such as FASTA 8 and BLAST 9 ; via comparison with 
known proteins interactions in similar genomes (the so called Rosetta Stone Method 10 ); or by phylogenetic analysis 11 . 
More recently, a new class of methods has been proposed that relies on the available data on the global structure of the 
PPI networks for a growing number of organisms of completely sequenced genomeiiS* 1 ^. The most complete available 
on-line data are structured in a graph-like format, with graph sites indexed with protein names and links representing 
a physical experimentally tested interaction among two proteins. More limited databases on larger protein complexes 
are also availabl e! 14 ' 1 ^ . From the side of functional classification, databases are now available (MIPS2A and Gene 
Ontology among others*^), that provide a classification of a continuously growing number of proteins, listing them 
in different functional categories classes with a hierarchical-like organization. Among the presently available methods 
that try to exploit the global PPI network structure to infer yet unknown functions for unclassified proteins whose 
interactions with the rest of the graph are at least partially known, there are the so called Neighboring Counting 
Method^, the \ 2 Method^, the Bayesian approaches^, the Redundancy Method^ and a more recent Monte Carlo 
Simulated Annealing (SA) approach^. 

II. METHODS 

Let us name Q a PPI graph, with set of vertexes V = {1, • • • , N} representing the observed proteins, each protein 
name being assigned a numerical value form 1 to N. Let us also define a mapping between the set of all observed 
functions and the numbered set T = {1, ■ ■ • , F}. Each protein i belonging to V can then be characterized via a 
discrete variable Xi that can take values / 6 T . One would like to compute the probability Pi(f) = Pr(Xi — /) for 
each protein to have a given function / given the functions assigned to the proteins in the rest of the graph. The 
method is based on the definition of a score function E on the PPI graph (see eq. QJ), that counts the number of 
all common predicted functions among neighboring proteins of the graph over all interactions. In addition to this, a 
certain fraction of the proteins is already classified, which means that there exists a subset A C V of vertexes with at 
least one function belonging to T attached to it (see fig.Q ( a ) f° r an example of a graph portion). The effect of the 
already classified proteins with a given function in the neighborhood of protein i on the PPI network is taken into 
account as an external field acting on i and proportional to the number of the neighbors belonging to A with that 
given function. From this score function a variational potential (called Gibbs potential) can be defined that measures 
the distance between the true unknown function probabilities and a trial estimation of them. The values of the best 
estimated probabilities are found extremizing the Gibbs potentials* 1 ^. The Gibbs potential extremizing equations 
used in this work are commonly known under the name of Belief Propagation (BP) equations and can be easily found 
via a procedure called Cavity Method^. We have solved the BP equations both for the probabilities of completely 



unclassified proteins belonging to V\A and for the more complete model where we let a protein belonging to A the 
possibility of having other yet unknown functions. The modifications to be applied to the method are technical minor 
so that they will not be described here. Given a choice of initial conditions on probability functions {i 3 i(^i)}i=i 1 ...,iv 
and a choice of the score function E, the algorithm calculates the stationary probabilities whose values extremize 
the resulting Gibbs potential. The potential in general depends on one free real parameter j3 that plays the role of 
an inverse temperature and weights the possibility of allowing functional assignments that do not exactly maximize 
the score, but could still be possible due to their large degeneracy: at low enough values of /3 (high temperature) 
almost any function assignation to proteins in V\A gives and equivalent value of the potential. In this region the 
system is said to be in a "paramagnetic phase" . Every functional assignment is therefore accepted and the algorithm 
is not predictive. After a certain critical value j3 c the shape of the Gibbs potential changes: only some values of the 
probability functions extremize it. Augmenting (3, the algorithm tends to weight more and more those functional 
assignments that exactly maximize the score. Strictly at zero temperature (/3 — > oo) only the score maximizing 
functional assignments survive with non zero probability. Given sets V , A and V\A, the PPI graph Q, the graph of 
unclassified proteins U C Q and the set of observed function J 7 , a score function can be defined following Vazquez et 
ali as 
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where J^- is the adjacency matrix of U (Jy = 1 if i and j G V\A and they interact with each other). 5(o~; r) is the 
Kronecker delta function measured between functions a and r assigned to the neighboring proteins and hi(r) is an 
external field that counts the number of classified neighbors of protein i in the original graph Q that have at least 
function r. The Gibbs potential can than be calculated as a variational way to compute the quantity 
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called free-energy of the system, a fundamental quantity than in statistical physics counts the logarithm of the sum of 
all the weights of the probabilities each configuration of the variables in the systems appear with. Configurations with 
a largest statistical weight can then be calculated as those maximizing this potential function. Using the message 
passing approaches under the assumption that correlations are low enough in the graph so that one can write 
Pij(Xi, Xj) (x Pi(Xi)Pj(Xj) if proteins i and j are chosen at random, one can calculate each Pj(Xj) as product of 
conditional probabilities contributions M^j(Xj) incoming to j from all neighbors of protein j, conditional to the fact 
that j has function Xf 
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where C V\A denotes the set of unclassified neighbors of j and Ui-^j(cr) is a "message" that represents the field 
in direction a G T acting on protein j due to the presence of protein i when protein j has function a. Equations for 
the message functions can be solved iteratively as fixed points of the system of equations 



r=i \iei(i)\j 



(4) 



one for each link of U, for both directions in the graph. Self consistent BP equations can be rewritten in terms of 
messages it's. The ones explicitly used in our algorithm are shown in the following: 
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FIG. 1: From Q to the BP equations: fig. (a) shows a small fraction of U network Q. Circles represent proteins with their 
numerical ID used by the algorithm. Classified proteins are filled, while unclassified ones are left white. Each classified protein 
has a series of functions whose numerical values € T are written in boxes. In (b) only the corresponding part of the U subgraph 
has been drawn. Dotted arrows represent external fields acting on the unclassified proteins and are vectors whose non zero 
components are defined in the Lower boxes. For each protein in U, they count the number of classified neighbors having a 
given function. Upper boxes are sets of all functions of all classified proteins neighboring a given unclassified one. Thick arrows 
represent "messages" among unclassified proteins according to eq. (|1J . Notice how U is significantly less connected than Q and 
often divides in smaller connected components, (c) Is a more detailed representation of the message passing between proteins 
i and j, in direction i — ^ j. 

The BP algorithm has been written in terms of that equation and solved at any [3 with a population dynamic 
technique^. In general, all previously described quantities depend on the inverse temperature (3. Eq. © turns out 
to be a good approximation of the solution of the problem of finding the probabilities for configurations maximizing 
(J2J. A pictorial view of the iteration procedure is shown in fig CJ(c). 
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TABLE I: Cluster size (cs) and number of clusters (noc) for both the original graph g (the two leftmost columns) and the 
graph of the unknow proteins U (the two rightmost columns). Note the giant componenent of 1299 sites for the g graph. 



III. DATA AND GRAPH ANALYSIS 



As benchmarks for the method, we have used two yeast Saccaromices Cerevisiee PPI graphs*^, referred in the 
following as U and D respectively. The functional categories set T was extracted form the MIPS database. The U 
network contains N = 1826 proteins out of which 1370 belong to A, while the remaining 456 are unclassified or have 
an unclear MIPS classification; and M — 2238 pairwise interactions. The D network contains N = 4713 proteins out 
of which 3303 belong to A and 1410 are unclassified; and M — 14846 interactions. Different choices of functional 
categories sets T are possible in the MIPS database, depending on the level of the coarse-grained specification of the 
hierarchical classification scheme. We used the latest publicly available finest classification scheme retrieving F = 165 
functional categories present U and F — 176 in D, but experiments where run also on the most coarse-grained 
classification scheme. Results are available upon request. 

The PPI graph consists of a giant component of 1299 sites (990 classified), and the rest of the sites are grouped into 
184 smaller isolated components of at most 13 sites. We have also analyzed the structure of the V\A graph which 
turns of 456 sites, grouped in 309 clusters of size at most 27. Each cluster in V\A can be considered as an isolated 
functional island of the graph surrounded by external fields as displayed in the last picture of the main body of the 
paper. More details on the cluster composition of both Q and U for the U PPI network are shown in Table [I] One 
may wonder if these clusters are more then a topological feature of our model, but reflect also a more interesting 
functional segregation. In other words one is interested to understand in quantitative terms how different clusters in 
V\A label different functional areas in our graphs. To this aim we measured inter-cluster and intra-cluster functional 
overlaps as in eqs. @ and J7J) Both observables take value in the interval (0, 1) and give a measure of the functional 
similarity of clusters (higher values indicate higher similarity). The emerging scenario shows clear signs of segregation 
since the intra-cluster overlap distributions has support in the interval (0, 0.1) while the inter-cluster distribution has 
support in the whole interval (0, 1). This test can be interpreted as a coherence test on the graph itself, and also on 
the working hypothesis of our method, since segregation is tacitly assumed in the functional form of score function 
where only first neighbors interations on the graph are taken into account. Let us define the notion of intra and inter 
cluster functional overlap as 
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where index i labels the different clusters Ci and run between 1 and the total number of clusters C, Ni is the number 
of site in cluster Ci, <j)(si, Sk) counts the number of function that site / and j have in common, and $(Ci) is the number 
of different functions acting onto cluster Ci, while <& (C^ (J Cj) is the number of different functions acting onto the union 
set Ci[jCj. It is interesting to note that according to Eqs. H3 both Oj and Oij have take real values in the interval 
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FIG. 2: Cumulative probability distribution of intra/inter-cluster funtional overlap as defined in Eqs.EJQfor the U U graph 
. Since the intra-cluster overlap turns out to be alwais lower than 0.085, the intra-cluster cumulative probability distribution 
(solid line) saturates to f above this value. The inter-cluster overlap instead shows clear sign of segregation. Note that the 
sudden jump at 1 for the dashed line is due to a significative fraction of clusters (84 out of 309) with functional overlap strictly 
equal to one. 



(0, 1). We can consider probability densities of the two variables Ot and oy as 

c 



pflnt ra)(0) = -L ^(O.OO (8) 
i=l 

p(inter) (°) = mhu £ o) 

v ' l<i<j<C 

where 5(a, b) is the Kronecker delta function equal to 1 when a = b and zero otherwise. It is interesting a this point 
to compare the average intracluster overlap (O) — 0.440673 with the average entercluster overlap (o) = 0.0294147 
which is a factor 15 smaller and that can be taken as a quantitative measure of the functional segregation on the PPI 
graph. 

We define then the cumulative distribution functions as 
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The two cumulative functions are displayed in Fig|2 The algorithm can be run separately and in parallel on all 
connected components of U, because there no exchange of information between them. Equivalently speaking, the 
score function can be written as a sum over all components c of separated scores: E = ^ £ c ({Xj }«£<;)■ 
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FIG. 3: (a) Fi, (b) F2 and (c) Sharpness versus protein degree for different U dilutions, as described in the text. Results 
are displayed for three dilution levels d^ — 0.4, ds = 0.5, di = 0.7. Dotted lines are results considering only functions of higher 
probabilities (1 st best rank). Dotted-dashed lines are results considering both best and second best ranks. Thick lines consider 
all non background noise ranks. SA and NCM are the Simulated Annealing and the Neighboring Counting Method results 
for dilution d = 0.4. Notice that a low value of Sharpness does not necessarily indicate a poor performance of the algorithm. It 
could also be due to the fact that indeed many functions have not been observed also in already classified proteins and therefore 
the catalogs are incomplete not only for proteins in U, but on all Q. 



IV. RESULTS 



We have run our algorithm solving eqs. and Q at several values of j3 > (3 C and for different choices of initial 
conditions of populations {-/^(A^)}^^.. ^ ■ Results are always very stable with respect to initial conditions. Instead 
of maximizing the Gibbs potential directly at zero temperature we have worked at finite (3 because we were interested 
also in predicting functional assignments that could be biologically allowed although not strictly maximizing the 
score function (Q. Above [3 Cl the function probabilities for each protein converge on a set of values organized in 
hierarchies. The probability values are /3-dependent, but not so the hierarchical structure (see fig JSJ) for an example). 
All results presented in the following are therefore taken at one given high value of (3 ((3 — 2 for the DIP PPI graph 
and /3 = 10 for the U PPI graph. For any protein i and the connected component c i belongs to, we have filtered out 
all background noise probability values for functions that are not present in c and still have a non zero contributions 
due to the form of eq. J2J). We have then collected and ranked the remaining functional probabilities, following their 
emerging hierarchical structure. A list of predicted functions for all the unclassified proteins in the U PPI network 
using MIPS 2003 functional categories catalog is presented in the Supplementary Table. The rank division is 
explained in fig.JSJ). In order to probe the reliability of our algorithm, we have followed the standard procedures of 
Vazquez et alA starting from Q and a corresponding MIPS functional annotation, we disregarded the functions of 
a given fraction d of classified proteins and considered them as unclassified. We have called d "dilution" of Q. If a 
previously classified protein is considered unclassified we say it has been "whitened" . With this procedure one obtains 
a new larger graph Ua of unclassified proteins, where the algorithm can be run and its ability of finding again the 
erased functions can be tested. This testing procedure is very similar but more stringent than the Leave One-Out 
Method^, because it assumes as unclassified an extensive fraction of proteins in the graph instead of only one each 
time. We repeated the procedure for both PPI networks. Results for a set of performance parameters are presented 
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FIG. 4: (d) Fx, (e) F2 and (f) Sharpness versus dilution, averaged over all the PPI network and over n — 10 random dilution 
realizations. Thick lines are results for the D network, dotted for U. For each network we have again considered 1 st best, 1 st 
and 2 nd best and all non noise ranks results. The different spacing between lines comparing the two networks reflect their 
different topological structure. Proteins to be whitened were chosen randomly in A. The procedure was repeated n = 10 times 
(Larger n datasets can be easily produced, but data are already very stable for n = 10) for each d, and the results averaged. 
We disregarded as statistically non significant the few observed proteins with k > 8. 

in fig. © as a function of protein degrees for some fixed dilution values: Fig.(a):Reliability. We have defined as a 
first reliability parameter F\ the fraction of whitened proteins for which the algorithm predicts correctly at least one 
function. Fig. (b) measures a second reliability parameter F2 , defined as the fraction of correctly predicted functions 
out of all functions a whitened protein has on the original graph Q. This test is more stringent because it checks the 
ability of the algorithm of predicting not only one function, but as many as it can. It is worth noticing that under 
the f 2 test the method still performs very well when all non background noise ranks are considered. The legenda is 
the same as in picture (a). (Fig.c): Sharpness. S measures the precision of the method and it is defined as the 
fraction of the number of correctly predicted functions over the number of all predicted ones. It is intuitive that the 
sharpness decreases with the number probability levels (ranks) one accepts as significant. For whitened proteins of 
degree 5, for instance, on average only 31% of predicted functions belong to the set of already known erased ones, 
while in the case of best ranks only, this percentage raises to 65%-70%. In case one allows the algorithm to predict 
still experimentally unobserved functions also on the classified proteins in Q, the sharpness still decreases. Results 
as a function of network dilution, are presented in fig. J3J When (see fig. ©-(a)) a direct comparison with other 
methods on the same Q and MIPS catalog was possible, results of our method were systematically better than both 
the Neighboring Counting Method and the SA. Performance further improves if we consider non only highest rank 
predictions, but all significant non background noise probabilities. Together with the other available methods, BP 
performs worse in predicting functions on leaves of the PPI graph, i.e. on whitened proteins with only one neighbor. 
Nevertheless, even in this case we observed better reliability results. 
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FIG. 5: Example of predicted probabilities ranks for protein YDR386W in the MIPS catalog and for the U PPI network. In 
this example, out of all possible 140 functions only the ones with a vertical bar have non background noise probabilities. Bars 
heights (ranks) are proportional to the logarithm of the probability of having a given function for all the functions ordered on 
the horizontal axis. 

V. DISCUSSION 

Hierarchical probabilities structure: Let consider as a simple example a protein i surrounded by 3 classified 
neighbors, two having function a and one having function r. according to in the zero temperature ((3 — > oo) limit 
one has Pi(a) = 1 and Pi{r) = together with all other function probabilities. However, if the interaction between 
protein i and the one neighbor with function r is correct, a biologically more sound functional assignment would be 
that of giving to protein i both functions. Working at finite (3 one can see again from (JSJ that a non zero value of 
Pj(r) is also found. The numerical value will continuously depend of j3. The hierarchy of the values of the predicted 
probabilities turns out to be nevertheless very stable after crossing the critical point j3 c . One example of probabilities 
at convergence at a given (3 and for a randomly chosen protein is given in fig. (J5J. 

Extension to the algorithm from the unclassified proteins graph U only to all proteins in Q: Looking 
at four subsequent versions MIPS databases releases (2001, 2002, 2003 and 2004 respectively, one can see that new 
functions are progressively assigned to already classified proteins too, so that an inference procedure that allows for 
this possibility is in principle more complete. However, this procedure can lead to a spreading in the values of inferred 
probabilities, loosing in Sharpness. Indeed, we tested the performance of our algorithms in both the general and 
the restricted case, without noticing any significant difference in performance. In fact, since Q is significantly more 
connected than 11 the algorithm is significantly slower in reaching convergence of the probability values (It still requires 
one single run, being therefore faster than the SA approach). Nevertheless, the possibility that the more general case 
would work significantly better under the definition of a more refined score function, or with a more complete and 
reliable PPI network, or with the extension of the method to multi-body interactions taking into consideration larger 
protein complexes cannot be ruled out. Results shown in the body of the text have been limited for clarity to the 
restricted case where inference is measured only on proteins £ V\A and for the 2002 and 2003 MIPS catalogs, in 
order to compare them with results already present in the literature. The same algorithm could be run on the latest 
2004 MIPS catalog release with no effort. 

Comparison with other available methods. Differently from SAl, BP algorithm allows to compute directly and 
in a single run all probabilities Pi{f ) for a given protein i to be assigned a function /. This is an advantage with 
respect to the SA approach, where the output of a single run is one configuration only out of a mutually exclusive set, 
and in order to obtain trustful probabilities one should average over a large number of SA runs. Moreover, provided 
one can trust the numerical BP results hierarchies at convergence, some non ground state configurations that could 
have a biological sound interpretation (see Methods for details) are captured in the BP approach in a hierarchical way, 
while are missed by SA unless one had time to run a number of cooling experiments of the order of 10 6 (Compare with 
the 10 2 runs reported in Vazquez et alii). Differently from Kasif et ali&, our version of BP naturally converges and 
does not therefore need iterations truncation. The connection of computed probability values with the real unknown 
ones can be made only at convergence of the BP iteration equations, and it is not clear how to interpret the probability 
values after only a limited number (two in Kasif et alr§>) of iteration steps, when one could still be in the middle of 
a transient still heavily dependent on initial conditions. Moreover, truncating the iteration after a small number of 
steps means disregarding propagation of information coming from distant regions of the network, which is the spirit of 
any message passing algorithm like BP. The method could still in principle work if the most distant message passing 



nodes of any chosen node i £ V\A were a few neighboring steps away. This turns out to be almost the case for the 
considered PPI networks, due to high clusterization and function segregation of V\A, as described in Methods, but 
it is not generally true in inference problems. In a second Bayesian approach 5 , a large number of external parameters 
(one set for each function) has to be estimated before running the Bayesian inference algorithm. Still, the Gibbs 
potential 5 could in principle be of a more complete form, allowing for the presence of a chemical potential-like terms 
(one for each function) proportional to the overall number of times one function is present in the graph. However, it 
is not clear what the reliability of the biological significance of a term of this type, since influence from the classified 
functions of distant proteins should already be taken into account in the structure of the message massing procedure. 
Moreover, if the property of functional segregation was true also on the complete (still unknown) PPI network (See 
Barabasi et ali^l for some indications that this might be the case), it is not clear why a protein should have a high 
probability of being classified with a certain function only because a large group of proteins with a very frequent 
function existed, even if not interacting with the protein under consideration. In addition, our BP method does not 
require keeping track of single configurations of functions under the iterations, but only directly of probability weights. 
The algorithm converges to a stable fixed point and does not need the definition of a measuring time window period^. 
Together with the Monte Carlo approach, our algorithm does not need previous estimation of external parameters 
defining the Gibbs potential, except for the overall tuning inverse temperature (3. 

Limitations. Our method has got of course many limitations. (1) The uncertainty over the graph structure, due to 
the presence of false positive and false negative interactions. The network topology could vary greatly and the network 
instead of being divided into connected components could be made of essentially only one giant component. The degree 
distribution of the network could vary, even though some authors suggest there is evidence for a stabilization towards 
a scale-free like formal. Attempts of healing PPI networks errors or missing links are described in the literature 22,23 , 
together with a general description of a message passing approach to network reconstruction 19 . Our algorithm could 
be generalized to partially deal in parallel with these problems, considering two sets of dynamical variables {Xi} 
and {Jij} instead of {Xi} only. Each could then take values in a discrete set measuring the likelihood of the 
interaction between proteins i and j to be present as a function of reliability of the experimental data and of the 
predicted functions assigned to the proteins under consideration. The extrapolated set {Jij} could then be taken as a 
starting point to calculate new function probabilities over the whole graph using again the BP procedure. Extensions 
of the method in this direction are under study but are not presented in this paper. (2) Pairwise interactions in the 
observed PPI graph could hide a more complex hyper-graph like structure, with more than two proteins interacting 
trough protein complexes. Our algorithm is readily generalizablc to these cases, but we have not tried to test it on 
actual data yet. (3) The way the BP algorithm predicts functions on classified proteins is intrinsically different from 
the way new annotations are listed in the growing catalogs: to the authors understanding, experiments are typically 
run concentrating on one or a limited number of interesting function, while other functions could be disregarded. The 
inference algorithms predictions treat all functions in the same way, so frequency of predicted functions could differ 
significantly from the experimentally observed ones. (4) The numerical values of the probabilities can be proved to 
be correct (for a given score function) only in the case the PPI graph is strictly a tree. This is not the case for the 
experimental data, where cycles are present. However, we believe that the order of magnitude of the Pi(X{) values 
should be trustful. This approximation is one of the sources of error in the results of fig. J2J. (5) The PPI graph is 
usually built as a time and space average of all processes going all within the cell: a given protein classified for instance 
with functions 1 and 2 could in principle interact with two other proteins at times in the cell cycle and/or in different 
places. One of the neighboring protein could then take common function 1, while the other could take common 
function 2, in a perfectly sound configuration. Running the BP algorithm on the averaged graph would however lead 
to the prediction of both functions on both neighbors. In this way the algorithm would loose predictive power and 
sharpness, however it would still predict the correct functions with a certain probability whose exact numerical values 
should again be taken "cum grano salis", as already said in point (4). (6) Different databases information should 
be merged in a proper way: for instance, the U and D PPI graphs are significantly different both in overlap and in 
topological structure (this point is strictly connected with (1)). We have decided to apply the function prediction 
method to both graphs separately, but in principle it could be used on a merging of different networks that properly 
weights relative reliability of interaction links. (7) The Kronecker delta function defines a binary distance between 
functions: one link in IA contributes 1 to the total score only if the interacting proteins have exactly the same function; 
otherwise. However, The MIPS classification scheme is organized hierarchically: some proteins have very specific 
functions, while others can be classified only in a more coarse-grained functional categories. The choice of a binary 
distance is probably appropriate if one considers only functional categories at a given hierarchical level but it seems 
unsatisfactory for the total classification, where a more complete notion of hierarchical distance between different 
functional categories would be needed. In particular one would like to have a distance that recognizes as possibly 
close two neighboring proteins of functions a and r in the case a belongs to a very specific functional category, while r 
belongs only to a broader one that includes the first, but with no further specification. In this paper we have limited 
the method to the binary distance score function considering only functions at a chosen hierarchical level in the 



MIPS catalogs and disregarding all the others. In this way some information on partial knowledge of the functions 
assigned to a given protein is lost. This limitation becomes particularly dramatic in the case of the use of catalogs 
that are not organized hierarchically, but in a more complex way such as Gene Ontology. Extensions of this method 
are under study. 
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